[LEADS-488] Adding support for multiple tags by xmican10 · Pull Request #278 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-07-01T13:46:59Z

Description

Support multiple tags, backward compatibility must be ensured.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Tags can now contain multiple values across evaluation data and results.
- Statistics and filtering now account for multi-tag results (a result is counted in each matching tag bucket).
Bug Fixes
- Tag inputs are normalized and validated; empty or whitespace-only entries are rejected.
- Conversation tag matching now uses “any tag matches” semantics.
- Error and skipped results now consistently preserve conversation tags, and tag storage/serialization supports multi-tags.
Tests
- Updated and expanded unit tests for multi-tag normalization, statistics, filtering, and storage behavior.

coderabbitai · 2026-07-01T13:47:15Z

Walkthrough

Tag values now normalize to sets of strings, and downstream statistics, storage, filtering, and error-result creation handle multiple tags per record. Tests and configuration docs were updated to reflect the new tag shape and matching behavior.

Changes

Multi-tag support

Layer / File(s)	Summary
Tag schema and normalization `src/lightspeed_evaluation/core/models/data.py`, `docs/configuration.md`	`EvaluationData.tag` and `EvaluationResult.tag` now use `set[str]`, with validators normalizing string or list inputs and rejecting empty or invalid values; the CSV tag description now documents string-or-list input and JSON-array storage.
Tag serialization and persistence `src/lightspeed_evaluation/core/output/serializers.py`, `src/lightspeed_evaluation/core/storage/sql_storage.py`, `src/lightspeed_evaluation/core/storage/langfuse_storage.py`, `tests/unit/core/storage/test_sql_storage.py`	Tags are serialized in sorted order, stored as JSON in SQL, the SQL column widens to `Text`, Langfuse import detection switches to `importlib.util`, and storage tests expect set-based tags and JSON output.
Tag grouping and scope matching `src/lightspeed_evaluation/core/output/statistics.py`, `src/lightspeed_evaluation/core/system/validator.py`, `tests/unit/core/output/test_statistics_detailed.py`, `tests/unit/core/system/test_validator.py`	Tag statistics now bucket each result under every tag it carries, and scope filtering matches when any tag overlaps the requested set; tests cover single-tag, multi-tag, and OR-style matching cases.
Error result tag propagation `src/lightspeed_evaluation/pipeline/evaluation/errors.py`, `src/lightspeed_evaluation/pipeline/evaluation/processor.py`, `tests/unit/pipeline/evaluation/conftest.py`	Error result creation APIs require `tag: set[str]`, conversation processing passes conversation tags into invalid-metric errors, and the error-handler test fixture constructs tagged results from the new argument.
Model validation tests `tests/unit/core/models/test_data.py`	`EvaluationData` and `EvaluationResult` tests now assert set-based defaults, normalization from string and list inputs, multi-tag acceptance, and rejection of empty or invalid tag values.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant ConversationProcessor
  participant EvaluationErrorHandler
  participant EvaluationResult

  ConversationProcessor->>EvaluationErrorHandler: create_error_result(..., tag=conv_data.tag)
  EvaluationErrorHandler->>EvaluationResult: _create_result(..., tag=set[str])
  EvaluationResult-->>ConversationProcessor: EvaluationResult with tag set

Possibly related PRs

lightspeed-core/lightspeed-evaluation#209: Updates tag-aware statistics flows that depend on multi-tag EvaluationResult grouping.
lightspeed-core/lightspeed-evaluation#233: Touches the same tag aggregation path in core/output/statistics.py.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: support for multiple tags.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

tests/unit/core/storage/test_sql_storage.py (1)
269-313: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add assertion for tag JSON serialization.

test_all_fields_stored sets tag=["integration"] but never asserts on row.tag, so the new json.dumps(result.tag) behavior in sql_storage.py isn't actually verified here (unlike expected_response, which has its own test_list_serialization). As per coding guidelines, "Add comprehensive pytest coverage for new features, including mocked LLM calls where applicable."
✅ Proposed assertion addition
         assert row is not None
         assert row.conversation_group_id == "conv_full"
         assert row.score == 0.92
         assert row.evaluation_latency == 2.5
         assert row.tool_calls == '[{"name": "search"}]'
+        assert row.tag == '["integration"]'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/storage/test_sql_storage.py` around lines 269 - 313, The SQL
storage test covers full-field persistence but does not verify JSON
serialization for the tag field, so the new json.dumps(result.tag) behavior in
SQLStorageBackend.save_result is untested. Update test_all_fields_stored in
test_sql_storage.py to assert the persisted row.tag value matches the serialized
tag list, using the existing result setup and row fetched from
evaluation_results.
Source: Coding guidelines
tests/unit/core/output/test_statistics_detailed.py (1)
317-373: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing coverage for a single result with multiple tags.

All updated cases use single-element tag lists. None exercises the new multi-bucket behavior itself, i.e. a result whose tag contains more than one value (e.g., tag=["production", "staging"]) contributing to multiple by_tag groups simultaneously. Consider adding a case for that scenario to directly validate the for t in r.tag iteration added in compute_tag_stats.
✅ Example additional test case
def test_compute_detailed_stats_by_multiple_tags_single_result(self) -> None:
    """Test a single result with multiple tags contributes to each tag group."""
    results = [
        EvaluationResult(
            conversation_group_id="conv1",
            tag=["production", "staging"],
            turn_id="turn1",
            metric_identifier="metric1",
            result="PASS",
            score=0.9,
            threshold=0.7,
            reason="Good",
        ),
    ]

    stats = compute_detailed_stats(results).model_dump()

    assert "production" in stats["by_tag"]
    assert "staging" in stats["by_tag"]
    assert stats["by_tag"]["production"]["passed"] == 1
    assert stats["by_tag"]["staging"]["passed"] == 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/output/test_statistics_detailed.py` around lines 317 - 373,
The detailed stats tests still miss the new multi-tag bucket behavior. Add a
focused test alongside test_compute_detailed_stats_by_tag that uses a single
EvaluationResult with tag containing multiple values (for example, both
production and staging) and assert compute_detailed_stats()/compute_tag_stats
places that one result into each corresponding by_tag entry. Verify both tag
groups show the expected passed/failed counts so the for t in r.tag logic is
directly covered.
src/lightspeed_evaluation/core/models/data.py (1)
589-595: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Avoid exposing validator helpers as public APIs without Google-style docs.

Either rename these validator methods to _normalize_tag or expand the docstrings to Google style. As per coding guidelines, src/lightspeed_evaluation/**/*.py: all public APIs must use Google-style docstrings.

Also applies to: 725-731
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/models/data.py` around lines 589 - 595, The
validator helper methods normalize_tag and the similar validator in the
referenced block are being exposed as public APIs without Google-style docs.
Rename these helpers to private names like _normalize_tag, or update their
docstrings to Google-style format so they comply with the project’s public API
documentation rules while keeping the field_validator behavior unchanged.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/models/data.py`:
- Around line 523-527: Preserve non-empty tag validation in the tag
normalization path: `Field(min_length=1)` on `tag` only enforces list size, so
empty strings can still slip through after `EvaluationData.normalize_tag` and
`EvaluationResult.normalize_tag` convert inputs to `list[str]`. Update those
normalization methods to reject or filter out empty/blank items, and add an
item-level string constraint on `tag` so each element must be non-empty.

In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 222-226: The tag grouping in the statistics aggregation is
counting duplicate tags from EvaluationResult.tag more than once. Update the
grouping logic in the statistics code that builds grouped from results so it
iterates over unique tags per result (for example by deduplicating r.tag before
appending), or normalize tags earlier in the EvaluationResult flow to prevent
double counting in tag stats.

In `@tests/unit/core/models/test_data.py`:
- Around line 457-464: The single-string normalization tests are not exercising
the actual compatibility path because
`test_single_string_tag_normalized_to_list` passes a list instead of a string.
Update the `EvaluationData` and `EvaluationResult` coverage in these tests to
pass `tag="basic"` so the `mode="before"` string-to-list normalization is
verified, and keep the assertions checking the normalized list output.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/models/data.py`:
- Around line 589-595: The validator helper methods normalize_tag and the
similar validator in the referenced block are being exposed as public APIs
without Google-style docs. Rename these helpers to private names like
_normalize_tag, or update their docstrings to Google-style format so they comply
with the project’s public API documentation rules while keeping the
field_validator behavior unchanged.

In `@tests/unit/core/output/test_statistics_detailed.py`:
- Around line 317-373: The detailed stats tests still miss the new multi-tag
bucket behavior. Add a focused test alongside test_compute_detailed_stats_by_tag
that uses a single EvaluationResult with tag containing multiple values (for
example, both production and staging) and assert
compute_detailed_stats()/compute_tag_stats places that one result into each
corresponding by_tag entry. Verify both tag groups show the expected
passed/failed counts so the for t in r.tag logic is directly covered.

In `@tests/unit/core/storage/test_sql_storage.py`:
- Around line 269-313: The SQL storage test covers full-field persistence but
does not verify JSON serialization for the tag field, so the new
json.dumps(result.tag) behavior in SQLStorageBackend.save_result is untested.
Update test_all_fields_stored in test_sql_storage.py to assert the persisted
row.tag value matches the serialized tag list, using the existing result setup
and row fetched from evaluation_results.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c4a50ffd-c5dc-4b52-b6f4-ae1a51dde01a

📥 Commits

Reviewing files that changed from the base of the PR and between acb1fec and 6191049.

📒 Files selected for processing (12)

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/langfuse_storage.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
tests/unit/core/models/test_data.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/conftest.py

asamal4

Thanks !! Some minor change..

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/models/data.py`:
- Around line 15-24: The _normalize_tag helper currently treats any non-string
input as iterable, which allows unsupported container types and can raise raw
TypeError for scalars. Update _normalize_tag in data.py to explicitly accept
only str or list of str before assigning items, and reject everything else with
a clear ValueError; keep the existing invalid-item check and normalization logic
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23ec3c09-6ecd-4702-b50a-0b74d51b93cc

📥 Commits

Reviewing files that changed from the base of the PR and between 8a9c46a and 6eb0b91.

📒 Files selected for processing (14)

docs/configuration.md
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/serializers.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/langfuse_storage.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
tests/unit/core/models/test_data.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/conftest.py

✅ Files skipped from review due to trivial changes (1)

docs/configuration.md

🚧 Files skipped from review as they are similar to previous changes (8)

src/lightspeed_evaluation/core/output/statistics.py
tests/unit/pipeline/evaluation/conftest.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py
tests/unit/core/storage/test_sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
src/lightspeed_evaluation/core/storage/langfuse_storage.py

asamal4 · 2026-07-03T12:10:04Z

 from __future__ import annotations

-import importlib
+import importlib.util


why this change ?

This change adresses this pyright warning: warning: "util" is not a known attribute of module "importlib"

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/models/data.py Outdated

Comment thread src/lightspeed_evaluation/core/output/statistics.py

Comment thread tests/unit/core/models/test_data.py Outdated

xmican10 force-pushed the LEADS-488-multiple-tags-support branch 3 times, most recently from 8a9c46a to 2df3380 Compare July 1, 2026 15:21

asamal4 reviewed Jul 3, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/models/data.py Outdated

xmican10 force-pushed the LEADS-488-multiple-tags-support branch from 2df3380 to 6eb0b91 Compare July 3, 2026 09:49

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/models/data.py

Adding support for multiple tags

c029567

xmican10 force-pushed the LEADS-488-multiple-tags-support branch from 6eb0b91 to c029567 Compare July 3, 2026 10:12

xmican10 changed the title ~~Adding support for multiple tags~~ [LEADS-488] Adding support for multiple tags Jul 3, 2026

asamal4 reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LEADS-488] Adding support for multiple tags#278

[LEADS-488] Adding support for multiple tags#278
xmican10 wants to merge 1 commit into
lightspeed-core:mainfrom
xmican10:LEADS-488-multiple-tags-support

xmican10 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

asamal4 Jul 3, 2026

Uh oh!

xmican10 Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xmican10 commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

xmican10 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xmican10 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading